Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset, but, first, some background.

Background

The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.

Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us.

Answer the following questions in one paragraph for each.

  1. What makes voter behavior prediction (and thus election forecasting) a hard problem?

    • It’s hard to predict behavior because behavior can change at any time. Up until election day, candidates can do many things that change how the public views him or her. All of that uncertainty makes election forecasting a difficult problem.
  2. What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?

    • While others used a weighted average on pollsters, they failed to note that pollsters can suffer from bias. If a pollster was in favor of one candidate winning, they’re more likely to give better and better odds to that candidate as time progresses. Nate Silver recognized this as was able to formulate a way to weigh this bias as well. This was essentially a double layered approach to weighted averages that reduced error even further.
  3. What went wrong in 2016? What do you think should be done to make future predictions better?

    • Individual polls were wrong and predicted many states incorrectly. National polls were wrong in the same direction, exascerbating the errors. Overall, the errors spread unevenly, and the polls underestimated Trump’s chance of victory in many crucial states. I believe there needs to be a better way of analyzing which polling results are more trustworthy, thereby allowing data scientists to weigh each poll more accurately and reduce the amount of error.

Data

election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl
census$CensusTract = as.factor(census$CensusTract)

Election data

Following is the first few rows of the election.raw data:

county fips candidate state votes
NA US Donald Trump US 62984825
NA US Hillary Clinton US 65853516
NA US Gary Johnson US 4489221
NA US Jill Stein US 1429596
NA US Evan McMullin US 510002
NA US Darrell Castle US 186545

The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.

In our dataset, fips values denote the area (US, state, or county) that each row of data represent: i.e., some rows in election.raw are summary rows. These rows have county value of NA. There are two kinds of summary rows:

  • Federal-level summary rows have fips value of US.
  • State-level summary rows have names of each states as fips value.

Census data

Following is the first few rows of the census data:

CensusTract State County TotalPop Men Women Hispanic White Black Native Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
1001020100 Alabama Autauga 1948 940 1008 0.9 87.4 7.7 0.3 0.6 0.0 1503 61838 11900 25713 4548 8.1 8.4 34.7 17.0 21.3 11.9 15.2 90.2 4.8 0 0.5 2.3 2.1 25.0 943 77.1 18.3 4.6 0 5.4
1001020200 Alabama Autauga 2156 1059 1097 0.8 40.4 53.3 0.0 2.3 0.0 1662 32303 13538 18021 2474 25.5 40.3 22.3 24.7 21.5 9.4 22.0 86.3 13.1 0 0.0 0.7 0.0 23.4 753 77.0 16.9 6.1 0 13.3
1001020300 Alabama Autauga 2968 1364 1604 0.0 74.5 18.6 0.5 1.4 0.3 2335 44922 5629 20689 2817 12.7 19.7 31.4 24.9 22.1 9.2 12.4 94.8 2.8 0 0.0 0.0 2.5 19.6 1373 64.1 23.6 12.3 0 6.2
1001020400 Alabama Autauga 4423 2172 2251 10.5 82.8 3.7 1.6 0.0 0.0 3306 54329 7003 24125 2870 2.1 1.6 27.0 20.8 27.0 8.7 16.4 86.6 9.1 0 0.0 2.6 1.6 25.3 1782 75.7 21.2 3.1 0 10.8
1001020500 Alabama Autauga 10763 4922 5841 0.7 68.5 24.8 0.0 3.8 0.0 7666 51965 6935 27526 2813 11.4 17.5 49.6 14.2 18.2 2.1 15.8 88.0 10.5 0 0.0 0.6 0.9 24.8 5037 67.1 27.6 5.3 0 4.2
1001020600 Alabama Autauga 3851 1787 2064 13.1 72.9 11.9 0.0 0.0 0.0 2642 63092 9585 30480 7550 14.4 21.9 24.2 17.5 35.4 7.9 14.9 82.7 6.9 0 0.0 6.0 4.5 19.8 1560 79.4 14.7 5.8 0 10.9

Census data: column metadata

Column information is given in metadata.

CensusTract Census.tract.ID numeric
State State, DC, or Puerto Rico string
County County or county equivalent string
TotalPop Total population numeric
Men Number of men numeric
Women Number of women numeric
Hispanic % of population that is Hispanic/Latino numeric
White % of population that is white numeric
Black % of population that is black numeric
Native % of population that is Native American or Native Alaskan numeric
Asian % of population that is Asian numeric
Pacific % of population that is Native Hawaiian or Pacific Islander numeric
Citizen Number of citizens numeric
Income Median household income ($) numeric
IncomeErr Median household income error ($) numeric
IncomePerCap Income per capita ($) numeric
IncomePerCapErr Income per capita error ($) numeric
Poverty % under poverty level numeric
ChildPoverty % of children under poverty level numeric
Professional % employed in management, business, science, and arts numeric
Service % employed in service jobs numeric
Office % employed in sales and office jobs numeric
Construction % employed in natural resources, construction, and maintenance numeric
Production % employed in production, transportation, and material movement numeric
Drive % commuting alone in a car, van, or truck numeric
Carpool % carpooling in a car, van, or truck numeric
Transit % commuting on public transportation numeric
Walk % walking to work numeric
OtherTransp % commuting via other means numeric
WorkAtHome % working at home numeric
MeanCommute Mean commute time (minutes) numeric
Employed % employed (16+) numeric
PrivateWork % employed in private industry numeric
PublicWork % employed in public jobs numeric
SelfEmployed % self-employed numeric
FamilyWork % in unpaid family work numeric
Unemployment % unemployed numeric

Data wrangling

  1. Remove summary rows from election.raw data: i.e.,

    • Federal-level summary into a election_federal.

    • State-level summary into a election_state.

    • Only county-level data is to be in election.

    # We can see that some data in election.raw is not encoded correctly or is missing
    # information. Specifically, FIPS 46102 of South Dakota has NA under county. We
    # need to be sure to add that to county-level data. Also, FIPS 2000 of Alaska has
    # NA under county. Oddly enough, the votes for FIPS 2000 and FIPS AK are exactly
    # the same. There is also no other county-level data for Alaska. I deduced that the
    # election data simply decided to group all the counties of Alaska into one. I
    # decided to put this in the county-level data anyways, since there is no other
    # county-level data on Alaska.
    election_federal = election.raw %>% filter(fips=="US")
    election_state = election.raw %>% filter(fips!="US", fips==as.character(state))
    election = election.raw %>% filter(!is.na(county) | fips=="46102" | fips=="2000")
  2. How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate

    nrow(election_federal)
    ## [1] 32
    # There were 32 presidential candidates in the 2016 election.
    ggplot(data=election_federal, aes(candidate, votes)) +
      geom_bar(stat="identity", aes(fill=candidate)) +
      ggtitle("Votes Received by Each Candidate") +
      theme(axis.text.x=element_blank(),
            axis.ticks.x=element_blank())

  3. Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).

    election.percentages = election %>% group_by(fips) %>% 
      mutate(total=sum(votes), percent = votes/total)
    county_winner = top_n(election.percentages, 1)
    ## Selecting by percent
    election_state.percentages = election_state %>% group_by(fips) %>% 
      mutate(total=sum(votes), percent = votes/total)
    state_winner = top_n(election_state.percentages, 1)
    ## Selecting by percent

Visualization

Visualization is crucial for gaining insight and intuition during data mining. We will map our data onto maps.

The R package ggplot2 can be used to draw maps. Consider the following code.

states = map_data("state")

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)  # color legend is unnecessary and takes too long

The variable states contain information to draw white polygons, and fill-colors are determined by region.

  1. Draw county-level map by creating counties = map_data("county"). Color by county

    counties = map_data("county")
    
    ggplot(data = counties) + 
      geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + 
      coord_fixed(1.3) +
      guides(fill=FALSE)  # color legend is unnecessary and takes too long

  2. Now color the map by the winning candidate for each state. First, combine states variable and state_winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables; however, they are in different formats: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to state_level New York Times map.

    states = states %>% mutate(fips = state.abb[match(region, tolower(state.name))])
    states = left_join(states, state_winner)
    ## Joining, by = "fips"
    ## Warning: Column `fips` joining character vector and factor, coercing into
    ## character vector
    ggplot(data = states) + 
      geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") + 
      coord_fixed(1.3)

  3. The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.

    my.county.fips = county.fips %>% separate(polyname, c("region", "subregion"), ",")
    my.county.fips = my.county.fips %>% mutate(fips=as.factor(fips))
    my.county.fips = left_join(my.county.fips, counties)
    ## Joining, by = c("region", "subregion")
    my.county.fips = left_join(my.county.fips, county_winner)
    ## Joining, by = "fips"
    ## Warning: Column `fips` joining factors with different levels, coercing to
    ## character vector
    ggplot(data = my.county.fips) + 
      geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") + 
      coord_fixed(1.3)

  4. Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

    county_population = census %>%
      mutate(region = tolower(State),
             subregion = tolower(County)) %>% 
      group_by(region, subregion) %>% 
      summarise_at(vars(TotalPop),funs(sum(.)))
    county_population = left_join(counties, county_population)
    ## Joining, by = c("region", "subregion")
    ggplot(data = county_population) + 
      geom_polygon(aes(x = long, y = lat, fill = TotalPop, group = group), color = "white", size=0.05) + 
      coord_fixed(1.3) + 
      scale_fill_gradient(trans = "log10") +
      ggtitle("Population Densities by County")

  5. The census data contains high resolution information (more fine-grained than county-level).
    In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:

    • Clean census data census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to a percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove {Walk, PublicWork, Construction}.
      Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted.

    • Sub-county census data, census.subct: start with census.del from above, group_by() two attributes {State, County}, use add_tally() to compute CountyTotal. Also, compute the weight by TotalPop/CountyTotal.

    • County census data, census.ct: start with census.subct, use summarize_at() to compute weighted sum

    • Print few rows of census.ct:

    census.del = na.omit(census) %>% 
      mutate(Men=Men/TotalPop, 
             Employed=Employed/TotalPop,
             Citizen=Citizen/TotalPop,
             Minority=Hispanic+Black+Native+Asian+Pacific) %>%
      select(-Hispanic, -Black, -Native, -Asian, -Pacific,
             -Walk, -PublicWork, -Construction, -Women)
    
    census.subct = 
      census.del %>% 
      group_by(State, County) %>%
      add_tally(wt=TotalPop)
    colnames(census.subct)[30]="CountyTotal"
    census.subct = census.subct %>%
      mutate(Weight=TotalPop/CountyTotal)
    
    census.ct = census.subct %>%
      select(-CensusTract) %>%
      summarise_all(funs(weighted.mean(., w=Weight)))
    
    kable(census.ct %>% head)
    State County TotalPop Men White Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority CountyTotal Weight
    Alabama Autauga 6486.404 0.4843266 75.78823 0.7374912 51696.29 7771.009 24974.50 3433.674 12.91231 18.70758 32.79097 17.17044 24.28243 17.15713 87.50624 8.781235 0.0952590 1.3059687 1.8356531 26.50016 0.4343637 73.73649 5.433254 0.0000000 7.733726 22.53687 55221 0.1174626
    Alabama Baldwin 7698.957 0.4884866 83.10262 0.7569406 51074.36 8745.050 27316.84 3803.718 13.42423 19.48431 32.72994 17.95092 27.10439 11.32186 84.59861 8.959078 0.1266209 1.4438000 3.8504774 26.32218 0.4405113 81.28266 5.909353 0.3633269 7.589820 15.21426 195121 0.0394573
    Alabama Barbour 3325.195 0.5382816 46.23159 0.7691222 32959.30 6031.065 16824.22 2430.189 26.50563 43.55962 26.12404 16.46343 23.27878 23.31741 83.33021 11.056609 0.4954032 1.6217251 1.5019456 24.51828 0.3192113 71.59426 7.149837 0.0897742 17.525557 51.94382 26932 0.1234663
    Alabama Bibb 6380.718 0.5341090 74.49989 0.7739781 38886.63 5662.358 18430.99 3073.599 16.60375 27.19708 21.59010 17.95545 17.46731 23.74415 83.43488 13.153641 0.5031366 1.5620952 0.7314679 28.71439 0.3669262 76.74385 6.637936 0.3941515 8.163104 24.16597 22604 0.2822827
    Alabama Blount 7018.573 0.4940565 87.85385 0.7337550 46237.97 8695.786 20532.27 2052.055 16.72152 26.85738 28.52930 13.94252 23.83692 20.10413 84.85031 11.279222 0.3626321 0.4199411 2.2654133 34.84489 0.3844914 81.82671 4.228716 0.3564928 7.699640 10.59474 57710 0.1216180
    Alabama Bullock 4263.211 0.5300618 22.19918 0.7545420 33292.69 9000.345 17579.57 3110.645 24.50260 37.29116 19.55253 14.92420 20.17051 25.73547 74.77277 14.839127 0.7732160 1.8238247 3.0998783 28.63106 0.3619592 79.09065 5.273684 0.0000000 17.890026 76.53587 10678 0.3992518

Dimensionality reduction

  1. Run PCA for both county & sub-county level data. Save the principal components data frames, call them ct.pc and subct.pc, respectively. What are the most prominent loadings of the first two principal components PC1 and PC2?

    census.ct = census.ct %>% ungroup
    census.subct = census.subct %>% ungroup
    tmp.ct = census.ct %>%
      select(-State, -County, -CountyTotal, -Weight)
    tmp.subct = census.subct %>%
      select(-CensusTract, -State, -County, -CountyTotal, -Weight)
    
    # We should scale because variances of the different features will be
    # vastly different.
    ct.pc = prcomp(tmp.ct,
                   center=TRUE,
                   scale=TRUE)
    subct.pc = prcomp(tmp.subct,
                      center=TRUE,
                      scale=TRUE)
    
    biplot(ct.pc)

    biplot(subct.pc)

    # At a glance, it appears that the Income and Minority loadings among others
    # have a relatively large magnitude on PC1 and PC2.

Clustering

  1. With census.ct, perform hierarchical clustering using Euclidean distance metric complete linkage to find 10 clusters. Repeat clustering process with the first 5 principal components of ct.pc. Compare and contrast clusters containing San Mateo County. Can you hypothesize why this would be the case?

    dist.ct = dist(ct.pc$x)
    hc.ct = hclust(dist.ct, method="complete")
    
    ct.pc5 = ct.pc$x[,1:5]
    dist.ct5 = dist(ct.pc5)
    hc.ct5 = hclust(dist.ct5, method="complete")
    col_vector = c("#000000", "#FF0000", "#608E46", "#F3DD00", "#87FF00",
                   "#00FFD1", "#0093FF", "#0013FF", "#C500FF", "#FF008B")
    
    # Plot all PCs
    cols_t1<-col_vector[cutree(hc.ct, k=10)]
    plot(ct.pc$x, col=cols_t1)
    legend(x=-10, y=10, legend=c("Trump", "Clinton"),
           col=c("red", "blue"), lty=1, cex=0.8)

    # Plot first 5 PCs
    cols_t2<-col_vector[cutree(hc.ct5, k=10)]
    plot(ct.pc5, col=cols_t2)
    legend(x=-10, y=10, legend=c("Trump", "Clinton"),
           col=c("red", "black"), lty=1, cex=0.8)

    summary(ct.pc)
    ## Importance of components:
    ##                           PC1    PC2    PC3     PC4     PC5     PC6
    ## Standard deviation     2.5836 1.9248 1.8166 1.30618 1.09784 1.07422
    ## Proportion of Variance 0.2567 0.1425 0.1269 0.06562 0.04636 0.04438
    ## Cumulative Proportion  0.2567 0.3992 0.5262 0.59177 0.63812 0.68251
    ##                            PC7     PC8     PC9    PC10    PC11    PC12
    ## Standard deviation     1.07053 0.99209 0.93460 0.88405 0.86656 0.78071
    ## Proportion of Variance 0.04408 0.03786 0.03359 0.03006 0.02888 0.02344
    ## Cumulative Proportion  0.72658 0.76444 0.79804 0.82810 0.85698 0.88042
    ##                           PC13    PC14    PC15    PC16    PC17    PC18
    ## Standard deviation     0.70711 0.69387 0.62992 0.59578 0.56022 0.54534
    ## Proportion of Variance 0.01923 0.01852 0.01526 0.01365 0.01207 0.01144
    ## Cumulative Proportion  0.89965 0.91817 0.93343 0.94708 0.95915 0.97059
    ##                          PC19    PC20   PC21    PC22    PC23    PC24
    ## Standard deviation     0.4646 0.42231 0.3678 0.31065 0.24869 0.21151
    ## Proportion of Variance 0.0083 0.00686 0.0052 0.00371 0.00238 0.00172
    ## Cumulative Proportion  0.9789 0.98575 0.9910 0.99467 0.99705 0.99877
    ##                           PC25    PC26
    ## Standard deviation     0.17221 0.04921
    ## Proportion of Variance 0.00114 0.00009
    ## Cumulative Proportion  0.99991 1.00000
    # I don't know how to find San Mateo County specifically, but I imagine
    # that due to using less Principal Components, we have put San Mateo County
    # in the wrong cluster! We can see from the summary that 5 PCs only accounts
    # for 60% of the variance. We lost a lot of information, so finding San Mateo
    # or any other county in a different cluster from using all PCs would not be
    # surprising

Classification

In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. Following code makes necessary changes to merge them into election.cl for classification.

tmpwinner = county_winner %>% ungroup %>%
  mutate(state = state.name[match(state, state.abb)]) %>%               ## state abbreviations
  mutate_at(vars(state, county), tolower) %>%                           ## to all lowercase
  mutate(county = gsub(" county| columbia| city| parish", "", county))  ## remove suffixes
tmpcensus = census.ct %>% mutate_at(vars(State, County), tolower)

election.cl = tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

## saves meta information to attributes
attr(election.cl, "location") = election.cl %>% select(c(county, fips, state, votes, percent))
election.cl = election.cl %>% select(-c(county, fips, state, votes, percent, total, CountyTotal, Weight))

Using the following code, partition data into 80% training and 20% testing:

set.seed(10) 
n = nrow(election.cl)
in.trn= sample.int(n, 0.8*n) 
trn.cl = election.cl[ in.trn,]
tst.cl = election.cl[-in.trn,]

Using the following code, define 10 cross-validation folds:

set.seed(20) 
nfold = 10
folds = sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))

Using the following error rate function:

calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","knn","glm")

Classification: native attributes

  1. Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records variable.

    election_tree = tree(candidate~.,
                data = trn.cl,
                control = tree.control(nobs = nrow(trn.cl), minsize = 6, mindev = 1e-6))
    
    draw.tree(election_tree, cex=0.6, nodeinfo=TRUE)

    cv = cv.tree(election_tree, rand=folds, FUN=prune.misclass, K=nfold)
    best.size.cv = min(cv$size[which(cv$dev == min(cv$dev))])
    
    election_tree.pruned = prune.tree(election_tree, best = best.size.cv)
    draw.tree(election_tree.pruned, cex = 0.6, nodeinfo=TRUE)

    train.pred = predict(election_tree.pruned, trn.cl[,-1], type = 'class')
    test.pred = predict(election_tree.pruned, tst.cl[,-1], type = 'class')
    
    error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
    error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
    
    records[1,1] <- error.training.dt
    records[1,2] <- error.testing.dt
    records
    ##      train.error test.error
    ## tree  0.07084691 0.08469055
    ## knn           NA         NA
    ## glm           NA         NA
  2. K-nearest neighbor: train a KNN model for classification. Use cross-validation to determine the best number of neighbors, and plot number of neighbors vs. resulting training and validation errors. Compute test error and save to records.

    # do.chunk() for k-fold Cross-validation
    
    do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){ # Function arguments
    
      train = (folddef!=chunkid) # Get training index
    
      Xtr = Xdat[train,] # Get training set by the above index
      Ytr = Ydat[train] # Get true labels in training set
    
      Xvl = Xdat[!train,] # Get validation set
      Yvl = Ydat[!train] # Get true labels in validation set
    
      predYtr = knn(train=Xtr, test=Xtr, cl=Ytr, k = k) # Predict training labels
      predYvl = knn(train=Xtr, test=Xvl, cl=Ytr, k = k) # Predict validation labels
    
      data.frame(train.error = mean(predYtr != Ytr), # Training error for each fold
                 val.error = mean(predYvl != Yvl)) # Validation error for each fold
    
    }
    # Set error.folds (a vector) to save validation errors in future
    error.folds = NULL
    # Give possible number of nearest neighbours to be considered
    allK = c(1, seq(10,50,length.out=9))
    # Set seed since do.chunk() contains a random component induced by knn()
    set.seed(888)
    # Loop through different number of neighbors
    for (j in allK){
      tmp = plyr::ldply(1:nfold, do.chunk, # Apply do.chunk() function to each fold
                        folddef=folds, Xdat=election.cl[,-1], Ydat=election.cl$candidate, k=j)
      # Necessary arguments to be passed into do.chunk
      tmp$neighbors = j # Keep track of each value of neighors
      error.folds = rbind(error.folds, tmp) # combine results
    }
    # Transform the format of error.folds for further convenience
    errors = reshape2::melt(error.folds, id.vars=c('neighbors'), value.name='error')
    # Choose the number of neighbors which minimizes validation error
    val.error.means = errors %>%
      # Select all rows of validation errors
      filter(variable=='val.error') %>%
      # Group the selected data frame by neighbors
      group_by(neighbors, variable) %>%
      # Calculate CV error rate for each k
      summarise_all(funs(mean)) %>%
      # Remove existing group
      ungroup() %>%
      filter(error==min(error))
    
    # Best number of neighbors
    # if there is a tie, pick larger number of neighbors for simpler model
    numneighbor = max(val.error.means$neighbors)
    numneighbor
    ## [1] 20
    train.pred <- knn(train = trn.cl[,-1], test = trn.cl[,-1], cl = trn.cl$candidate, k = numneighbor)
    test.pred <- knn(train = trn.cl[,-1], test = tst.cl[,-1], cl = trn.cl$candidate, k = numneighbor)
    
    error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
    error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
    
    records[2,1] <- error.training.dt
    records[2,2] <- error.testing.dt
    records
    ##      train.error test.error
    ## tree  0.07084691 0.08469055
    ## knn   0.13070033 0.15146580
    ## glm           NA         NA

Classification: principal components

Instead of using the native attributes, we can use principal components in order to train our classification models. After this section, a comparison will be made between classification model performance between using native attributes and principal components.

pca.records = matrix(NA, nrow=3, ncol=2)
colnames(pca.records) = c("train.error","test.error")
rownames(pca.records) = c("tree","knn","glm")
  1. Compute principal components from the independent variables in training data. Then, determine the number of minimum number of PCs needed to capture 90% of the variance. Plot proportion of variance explained.

    # Scale since variances of each feature are of vastly different magnitudes
    election.pca = prcomp(trn.cl[,-1], scale=TRUE)
    summary(election.pca)
    ## Importance of components:
    ##                          PC1    PC2    PC3    PC4     PC5    PC6     PC7
    ## Standard deviation     2.509 1.9833 1.8322 1.2490 1.14904 1.1007 1.05190
    ## Proportion of Variance 0.242 0.1513 0.1291 0.0600 0.05078 0.0466 0.04256
    ## Cumulative Proportion  0.242 0.3933 0.5224 0.5824 0.63322 0.6798 0.72237
    ##                            PC8     PC9    PC10   PC11    PC12    PC13
    ## Standard deviation     0.98043 0.92586 0.91111 0.8624 0.78449 0.72848
    ## Proportion of Variance 0.03697 0.03297 0.03193 0.0286 0.02367 0.02041
    ## Cumulative Proportion  0.75934 0.79231 0.82424 0.8528 0.87652 0.89693
    ##                           PC14    PC15    PC16    PC17    PC18    PC19
    ## Standard deviation     0.69622 0.64011 0.61214 0.59291 0.54889 0.46333
    ## Proportion of Variance 0.01864 0.01576 0.01441 0.01352 0.01159 0.00826
    ## Cumulative Proportion  0.91557 0.93133 0.94574 0.95926 0.97085 0.97911
    ##                           PC20    PC21   PC22    PC23    PC24    PC25
    ## Standard deviation     0.42022 0.38365 0.3101 0.23525 0.18595 0.17459
    ## Proportion of Variance 0.00679 0.00566 0.0037 0.00213 0.00133 0.00117
    ## Cumulative Proportion  0.98590 0.99156 0.9953 0.99739 0.99872 0.99989
    ##                           PC26
    ## Standard deviation     0.05359
    ## Proportion of Variance 0.00011
    ## Cumulative Proportion  1.00000
    # We can see that it takes 14 PCs to explain 90% of the variance
    
    election.pca.out = summary(election.pca)
    election.pca.pve = election.pca.out$importance[2,]
    plot(election.pca.pve, xlab="Principal Component", 
         ylab="Variance Explained", main="Proportion of Variance Explained",
         ylim=c(0,1), type='b')

    election.pca.cve = election.pca.out$importance[3,]
    plot(election.pca.cve, xlab="Principal Component", 
         ylab="Variance Explained", main="Cumulative Proportion of Variance Explained",
         ylim=c(0,1), type='b')

  2. Create a new training data by taking class labels and principal components. Call this variable tr.pca. Create the test data based on principal component loadings: i.e., transforming independent variables in test data to principal components space. Call this variable test.pca.

    tr.pca = data.frame(trn.cl$candidate, election.pca$x[,1:14])
    colnames(tr.pca)[1]="candidate"
    
    election.test.pca = prcomp(tst.cl[,-1], scale=TRUE)
    test.pca = data.frame(tst.cl$candidate, election.test.pca$x[,1:14])
    colnames(test.pca)[1]="candidate"
  3. Decision tree: repeat training of decision tree models using principal components as independent variables. Record resulting errors.

    pca_tree = tree(candidate~.,
                data = tr.pca,
                control = tree.control(nobs = nrow(trn.cl), minsize = 6, mindev = 1e-6))
    
    draw.tree(pca_tree, cex=0.6, nodeinfo=TRUE)

    cv = cv.tree(pca_tree, rand=folds, FUN=prune.misclass, K=nfold)
    pca.size.cv = min(cv$size[which(cv$dev == min(cv$dev))])
    
    pca_tree.pruned = prune.tree(pca_tree, best = pca.size.cv)
    draw.tree(pca_tree.pruned, cex = 0.6, nodeinfo=TRUE)

    train.pred = predict(pca_tree.pruned, tr.pca[,-1], type = 'class')
    test.pred = predict(pca_tree.pruned, test.pca[,-1], type = 'class')
    
    error.training.dt <- calc_error_rate(train.pred, tr.pca$candidate)
    error.testing.dt <- calc_error_rate(test.pred, test.pca$candidate)
    
    pca.records[1,1] <- error.training.dt
    pca.records[1,2] <- error.testing.dt
    pca.records
    ##      train.error test.error
    ## tree  0.04723127  0.2247557
    ## knn           NA         NA
    ## glm           NA         NA
  4. K-nearest neighbor: repeat training of KNN classifier using principal components as independent variables. Record resulting errors.

    # Set error.folds (a vector) to save validation errors in future
    pca.error.folds = NULL
    # Set seed since do.chunk() contains a random component induced by knn()
    set.seed(888)
    # Loop through different number of neighbors
    for (j in allK){
      tmp = plyr::ldply(1:nfold, do.chunk, # Apply do.chunk() function to each fold
                        folddef=folds, Xdat=tr.pca[,-1], Ydat=tr.pca$candidate, k=j)
      # Necessary arguments to be passed into do.chunk
      tmp$neighbors = j # Keep track of each value of neighors
      pca.error.folds = rbind(pca.error.folds, tmp) # combine results
    }
    # Transform the format of error.folds for further convenience
    errors = reshape2::melt(error.folds, id.vars=c('neighbors'), value.name='error')
    # Choose the number of neighbors which minimizes validation error
    val.error.means = errors %>%
      # Select all rows of validation errors
      filter(variable=='val.error') %>%
      # Group the selected data frame by neighbors
      group_by(neighbors, variable) %>%
      # Calculate CV error rate for each k
      summarise_all(funs(mean)) %>%
      # Remove existing group
      ungroup() %>%
      filter(error==min(error))
    
    # Best number of neighbors
    # if there is a tie, pick larger number of neighbors for simpler model
    numneighbor = max(val.error.means$neighbors)
    numneighbor
    ## [1] 20
    train.pred <- knn(train = tr.pca[,-1], test = tr.pca[,-1], cl = tr.pca$candidate, k = numneighbor)
    test.pred <- knn(train = tr.pca[,-1], test = test.pca[,-1], cl = tr.pca$candidate, k = numneighbor)
    
    error.training.dt <- calc_error_rate(train.pred, trn.cl$candidate)
    error.testing.dt <- calc_error_rate(test.pred, tst.cl$candidate)
    
    pca.records[2,1] <- error.training.dt
    pca.records[2,2] <- error.testing.dt
    pca.records
    ##      train.error test.error
    ## tree  0.04723127  0.2247557
    ## knn   0.06840391  0.2231270
    ## glm           NA         NA

Interpretation & Discussion

  1. This is an open question. Interpret and discuss any insights gained and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc)

    • With clustering, we predicted how certain demographics might vote for a certain candidate. We saw that when using all PCs we got a more accurate result, with most predictions falling under Trump or Clinton. When we reduced the PCs to 5, we started seeing that predictions would get misclassified, and more points were being assigned to clusters for third party candidates. This seems accurate, since we are reducing our dimensions, thereby removing predictors variables can could possible be helpful in determining the right candidate for the clusters. Then, we did classification with Decision Trees and KNN. We saw that Decision Trees actually out performed KNN in both county level data with and without PCA. KNN is probably bad here because we have so many features. The curse of dimensionality makes it so that distances between two arbitrary observations may be very high, despite the observations being actually very similar. We can see in the training error specifically how reducing dimensions improves KNN. Overall, PCA does worse than non-PCA as expected, since we are effectively predicting class labels using less information.

Taking it further

  1. Propose and tackle at least one interesting question. Be creative! Some possibilities are:

    • Data preprocessing: we aggregated sub-county level data before performing classification. Would classification at the sub-county level before determining the winner perform better? What implicit assumptions are we making?

    • Feature engineering: would a non-linear classification method perform better? Would you use native features or principal components?

    • Additional classification methods: logistic regression, LDA, QDA, SVM, random forest, etc. (You may use methods beyond this course). How do these compare to KNN and tree method?

    • Bootstrap: Perform boostrap to generate plots similar to Figure 4.10/4.11. Discuss the results.

    # We can try to try to use glm, but first, we need to do it in a binary class setting
    # We currently have a multiclass problem. It looks like at the county level, only Trump
    # and Clinton were winners, but their factor levels are not binary.
    
    candidate = droplevels(election.cl)$candidate
    election.cl.new = data.frame(candidate, election.cl %>% select(-candidate))
    
    set.seed(10) 
    n = nrow(election.cl.new)
    in.trn= sample.int(n, 0.8*n) 
    trn.cl.new = election.cl.new[ in.trn,]
    tst.cl.new = election.cl.new[-in.trn,]
    
    election.glm <- glm(candidate~., family = binomial('logit'), data = trn.cl.new)
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    pred.glm.train <- predict(election.glm, trn.cl.new, type = 'response')
    pred.glm.test <- predict(election.glm, tst.cl.new, type = 'response')
    
    cl.glm.train <- ifelse(pred.glm.train > 0.5, 'Hillary Clinton', 'Donald Trump')
    cl.glm.test <- ifelse(pred.glm.test > 0.5, 'Hillary Clinton', 'Donald Trump')
    
    error.glm.train <- calc_error_rate(cl.glm.train, trn.cl.new$candidate)
    error.glm.test <- calc_error_rate(cl.glm.test, tst.cl.new$candidate)
    
    records[3,1] <- error.glm.train
    records[3,2] <- error.glm.test
    records
    ##      train.error test.error
    ## tree  0.07084691 0.08469055
    ## knn   0.13070033 0.15146580
    ## glm   0.06473941 0.07491857
    # Compared to Decision Tree and KNN, Logistic Regression does the best.
    # With KNN doing worst, our models seem to suggest that the true decision
    # boundaries are closer to linear than non linear.
    election.pca.new = prcomp(trn.cl.new[,-1], scale=TRUE)
    
    tr.pca.new = data.frame(trn.cl.new$candidate, election.pca.new$x[,1:14])
    colnames(tr.pca.new)[1]="candidate"
    
    election.test.pca.new = prcomp(tst.cl.new[,-1], scale=TRUE)
    test.pca.new = data.frame(tst.cl.new$candidate, election.test.pca.new$x[,1:14])
    colnames(test.pca.new)[1]="candidate"
    
    election.glm <- glm(candidate~., family = binomial('logit'), data = tr.pca.new)
    ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    pred.glm.train <- predict(election.glm, tr.pca.new, type = 'response')
    pred.glm.test <- predict(election.glm, test.pca.new, type = 'response')
    
    cl.glm.train <- ifelse(pred.glm.train > 0.5, 'Hillary Clinton', 'Donald Trump')
    cl.glm.test <- ifelse(pred.glm.test > 0.5, 'Hillary Clinton', 'Donald Trump')
    
    error.glm.train <- calc_error_rate(cl.glm.train, tr.pca.new$candidate)
    error.glm.test <- calc_error_rate(cl.glm.test, test.pca.new$candidate)
    
    pca.records[3,1] <- error.glm.train
    pca.records[3,2] <- error.glm.test
    pca.records
    ##      train.error test.error
    ## tree  0.04723127  0.2247557
    ## knn   0.06840391  0.2231270
    ## glm   0.08021173  0.3387622
    # Under PCA Logistic Regression seems to perform the worst out of the tree.
    # It's not surprising that CL logistic regression is better than that of PCA
    # but I do find it surprising that it went from the best classifier to the 
    # worst. It seems to imply that when doing logistic regression, we need more
    # information to build a better model.